The document reads .ris bibliographic files, filters
selected studies, and categorises data sources into Articles,
Packages, and Kaggle. Dataset visualisations summarise
population, sport type, data type, and geographic distribution. A final
evaluation scores all datasets according to predefined criteria,
comparing their suitability to be used to generate synthetic dataset
using Statistical and/or GAN-based approaches.
Kaggle.com platform.To compile, categorise, and visualise publicly available sports datasets from bibliographic and online sources, and to evaluate them using a scoring framework derived from the literature review to determine which datasets are best suited for Statistical or GAN-based modeling in the development of synthetic datasets.
1.Read bibliographic data from .ris files and filter
relevant studies through manual and Shiny-based screening. 2.Create a
combined dataset incorporating Articles, Packages, and
Kaggle entries. 3.Generate summary statistics and interactive
visualizations:
- Scatter and bar plots showing sample size, population, sport type, and data type.
- Global maps and stacked bar plots showing dataset distribution by country.
- Sankey diagram linking datasets, sports, and collected variables.
4.Define a scoring system for dataset quality across eight evaluation criteria. 5.Rank all datasets by total score and visualise the results across Statistical and GAN-based categories.
The analysis identifies GAN-based datasets primarily rely on video and image data focused on athlete performance and activity detection, while Statistical datasets encompass tabular (player and game statistics), physiological and survey-based data. The evaluation criteria highlights datasets most suitable for each approach, supporting the selection of appropriate data sources for developing synthetic datasets using Statistical or GAN-based methods.
| Category | Number of Datasets | Data Types | Population | Most Frequent Sports | Top 3 (by Score) |
|---|---|---|---|---|---|
| GAN-based | 16 | Video, Image | Athlete | Multiple, Basketball, Fitness | TeamTrack, C-Sports, SportsMOT |
| Statistical | 34 | Tabular, Physiological, Medical Record, Survey, Accelerometer | Athlete, Multiple | Football, Baseball, Basketball, Fitness | MTS-5, NCAA-ISP, LLBD |
## [1] "ebscoSport.ris" "ieee.ris" "qut.ris"
## [4] "scienceDirect.ris" "springerNature.ris" "wos.ris"
# Error in webofS, remove empty line
wos <- readLines("data/database/ris/wos.ris")
wos <- wos[wos != ""]
writeLines(wos, "data/database/ris/wos.ris")# Read all as list, conver to df
files <- list.files("data/database/ris", pattern = "\\.ris$", full.names = T)
bibliography <- read_bibliography(filename = files, return_df = T)
bibliography# Title preparition
bibliography$titleLower<-tolower(bibliography$title)
bibliography$titleLower<-strip(bibliography$titleLower, apostrophe.remove = TRUE)
head(bibliography$titleLower)## [1] "secondary prevention of musculoskeletal sports injuries a scoping review of early detection and early intervention strategies"
## [2] "the effects of rule changes in footballcode team sports a systematic review"
## [3] "how physical education teachers are positioned in models scholarship a scoping review"
## [4] "physical education from lgbtq students perspective a systematic review of qualitative studies"
## [5] "the altmetric score has a stronger relationship with article citations than journal impact factor and open access status a crosssectional analysis of sport sciences articles"
## [6] "methods of the national collegiate athletic association injury surveillance program â through â"
## [1] "crosssectional and longitudinal associations of active travel organised sport and physical education with accelerometerassessed moderatetovigorous physical activity in young people the international childrenâs accelerometry database"
## [2] "match score dataset for team ball sports"
## [3] "collective sports a multitask dataset for collective activity recognition"
## [4] "tgc reid a dataset for sport event reidentification in the wild"
## [5] "regular sports services dataset of demographic frequency and service level agreement"
## [6] "aspset an outdoor sports pose video dataset with d keypoint annotations"
## [7] "dataset for the analysis of tv viewer response to live sport broadcasts and sponsor messages"
## [8] "sports work strategy of college counselors based on mysql database big data analysis"
## [9] "epidemiology of testicular trauma in sports analysis of the national electronic injury surveillance system database"
## [10] "administrative databases used for sports medicine research demonstrate significant differences in underlying patient demographics and resulting surgical trends"
## [11] "analysis of research trends on elbow pain in overhead sports a bibliometric study based on web of science database and vosviewer"
## [12] "the racial and sexual differences in emergency department visits for sportrelated spine fracture injuries a neiss database study"
## [13] "comprehensive dataset on presarscov infection sportsrelated physical activity levels disease severity and treatment outcomes insights and implications for covid management"
## [14] "analysis of a comprehensive dataset influence of vaccination profile types and severe acute respiratory syndrome coronavirus reinfections on changes in sportsrelated physical activity one month after infection"
# Remove duplicated titles, keeping the first unique entry
bibliography <- bibliography[!duplicated(bibliography$titleLower), ]
# Check that duplicates are gone
any(duplicated(bibliography$titleLower))## [1] FALSE
## [1] 278 104
Filtering the dataset to keep only the selected articles, reducing
the number from 278 to 89.
bibliographyRev <- read.csv("data/database/bibliography/bibliographyRev.csv")
bibliographyRev <- bibliographyRev %>%
filter(screened_abstracts == "selected") %>%
dplyr::select(author, title, year, keywords, abstract, doi, titlelower,
filename)
# write.csv(bibliographyRev, "data/database/bibliography/bibliographyRevSelected.csv",
# row.names = FALSE)
dim(bibliographyRev)## [1] 89 8
## [1] "author" "title" "year" "keywords" "abstract"
## [6] "doi" "titlelower" "filename"
From the output file above, an excel file was created manually to
categorise the databases into Articles
(sheet = databaseAR), Packages (R and Python)
(sheet = databasePA), and Kaggle
(sheet = databaseOT).
Articles: Databases were searched using the keywords “sport” AND “database” or “sport” AND “dataset” for publicly available datasets.
Packages: Active and maintained packages were selected with databases related to athletes were included.
Kaggle: In the datasets category,
the keywords used were “injuries”, “sport”,
“NFL”, and “AFL”. In the
competitions category, only “sport” was used.
For both categories, only the top 10 datasets were
reviewed.
## [1] "bibliographyRevSelected" "databaseAR"
## [3] "databasePA" "databaseOT"
## [5] "database" "summary"
## [7] "rank"
The database sheet contains the merged data from all
files, and the summary sheet will be used to generate
insights and visualisations.
# Read the summary sheet
summary <- read_excel("data/database/bibliography/bibliographyRevSelected.xlsx",
sheet = "summary")
summarycolorPalette <- RColorBrewer::brewer.pal(8, "Set2")
f1 <- plot_ly(summary,
x = ~PopulationType, y = ~SampleOverall,
type = 'scatter', mode = 'markers',
color = ~PopulationType, colors = colorPalette,
size = ~SampleOverall, sizes = c(10, 60),
marker = list(opacity = 0.7, line = list(width = 1, color = '#333')),
hoverinfo = 'text',
text = ~paste('Dataset:', DatasetName,
'<br>Samples:', SampleOverall,
'<br>Population:', PopulationType),
showlegend = FALSE)
f2 <- plot_ly(summary %>% count(SportType),
x = ~SportType, y = ~n, type = 'bar',
color = ~SportType, colors = colorPalette,
showlegend = FALSE)
f3 <- plot_ly(summary %>% count(DataTypeRaw),
x = ~n, y = ~reorder(DataTypeRaw, n),
type = 'bar', orientation = 'h',
color = ~DataTypeRaw, colors = colorPalette,
showlegend = FALSE)
f4 <- plot_ly(summary,
x = ~DataTypeRaw, y = ~SampleOverall,
type = 'scatter', mode = 'markers',
color = ~ValidData, colors = c('#E15759', '#59A14F'),
size = ~SampleOverall, sizes = c(10, 50),
marker = list(opacity = 0.7),
hoverinfo = 'text',
text = ~paste('Dataset:', DatasetName,
'<br>Type:', DataTypeRaw,
'<br>Valid:', ValidData,
'<br>Samples:', SampleOverall))
fig <- subplot(f1, f2, f3, f4, nrows = 2, margin = 0.20) %>%
layout(
plot_bgcolor = "rgba(0,0,0,0)",
paper_bgcolor = "rgba(0,0,0,0)",
showlegend = TRUE,
legend = list(orientation = "h", x = 0.55, y = -0.15),
annotations = list(
list(text = "Sample Size by Population Type",
x = 0.20, y = 1.05, showarrow = FALSE,
xref='paper', yref='paper', font=list(size=14)),
list(text = "Datasets by Sport Type",
x = 0.80, y = 1.05, showarrow = FALSE,
xref='paper', yref='paper', font=list(size=14)),
list(text = "Data Type Distribution", x = 0.20, y = 0.47,
showarrow = FALSE, xref='paper', yref='paper', font=list(size=14)),
list(text = "Sample Size vs Data Type (by Validity)",
x = 0.80, y = 0.47, showarrow = FALSE, xref='paper',
yref='paper', font=list(size=14))
)
)
fig# Duplicate the rows by column and country.
# Dataset with multiple countries will have multiple rows
summaryMap <- summary %>%
mutate(Country = str_split(Country, ",")) %>%
unnest(Country) %>%
mutate(Country = str_trim(Country))
summaryMap# Generate the information to display in the map
countrySummary <- summaryMap %>%
group_by(Country) %>%
summarise(
nDatasets = n(),
datasets = paste(unique(column), collapse = "; "),
studyDesigns = paste(unique(StudyDesign), collapse = "; "),
sampleRange = paste0("Min: ", min(SampleRaw, na.rm = TRUE),
" | Max: ", max(SampleOverall, na.rm = TRUE)),
population = paste(unique(PopulationType), collapse = "; "),
sex = paste(unique(PopulationSex), collapse = "; "),
sports = paste(unique(SportType), collapse = "; "),
reference = paste(unique(ReferenceURL), collapse = "; ")
)
countrySummary# Create hover text with the information above
countrySummary <- countrySummary %>%
mutate(hoverText = paste0(
"<b>", Country, "</b><br>",
"Datasets: ", nDatasets, "<br>",
"Study Design: ", studyDesigns, "<br>",
"Sample Range: ", sampleRange, "<br>",
"Population: ", population, "<br>",
"Sex: ", sex, "<br>",
"Sports: ", sports, "<br>",
"Dataset Names: ", datasets, "<br>",
"Reference: ", reference
))
countrySummaryThe following map does not display the International
(n = 9) and
Commonwealth countries(n = 1) datasets.
# Interactive map
mapP <- plot_ly(
data = countrySummary,
type = "choropleth",
locations = ~Country,
locationmode = "country names",
z = ~nDatasets,
text = ~hoverText,
hoverinfo = "text",
colorscale = "Oranges",
colorbar = list(title = "<b>Number of Datasets</b>")
) %>%
layout(
title = "Global Distribution of Public Sports Datasets",
geo = list(
showframe = FALSE,
showcoastlines = TRUE,
projection = list(type = "Mercator"),
bgcolor = "rgba(0,0,0,0)",
domain = list(x = c(0.25, 1), y = c(0.15, 0.85)),
center = list(lon = 5, lat = 20)
),
plot_bgcolor = "rgba(0,0,0,0)",
paper_bgcolor = "rgba(0,0,0,0)"
)
mapPThe next dataframe and plot allow us to visualise the different types of datasets:
Article: Data validated and used in a
paper.Package: Dataset can be extracted from a
CRAN or Python.Kaggle: Available from the website
Kaggle.com.# Generate the plot
barP <- plot_ly(
data = barD,
x = ~nRefs,
y = ~Country,
type = "bar",
orientation = "h",
color = ~sourceType,
colors = c("#87CEEB", "#F4A261", "grey"),
text = ~paste0(nRefs, " ", sourceType),
textposition = "inside",
insidetextanchor = "middle",
textfont = list(color = "black", size = 9, family = "Arial"),
hovertext = ~paste0(Country, ": ", nRefs, " ", sourceType, " datasets"),
hoverinfo = "text",
showlegend = TRUE
) %>%
layout(
barmode = "stack",
title = "Datasets per Country by Source Type",
xaxis = list(title = "Count", showgrid = FALSE),
yaxis = list(title = "", automargin = TRUE),
legend = list(
title = list(text = "<b>Source Type</b>"),
orientation = "h",
x = 0.82, y = 0.05,
bgcolor = "rgba(0,0,0,0)",
bordercolor = "rgba(0,0,0,0)"
),
plot_bgcolor = "rgba(0,0,0,0)",
paper_bgcolor = "rgba(0,0,0,0)"
)
barP# Prepare the dataset selecting the relevant columns
variables <- summary %>%
select(column, SportType, DataTypeRaw, VariablesCollected, ReferenceURL) %>%
mutate(
sourceType = case_when(
str_detect(ReferenceURL, regex("Kaggle", ignore_case = TRUE)) ~ "Kaggle",
str_detect(ReferenceURL, regex("CRAN|Python", ignore_case = TRUE)) ~ "Package",
TRUE ~ "Article"
),
VariablesCollected = str_replace_all(
VariablesCollected,
regex("•\\s*", ignore_case = TRUE),
"<br>• "
),
VariablesCollected = paste0("<b>Variables:</b>", VariablesCollected)
)
variablesThe following plot links three sections:
Each flow represents a connection between these sections and is
colored by its data source type
(Kaggle-Orange, Package-Green, or Article-Blue). Move the
mouse over a flow to see the type of variables included in that
connection.
# Create Node List
nodes <- data.frame(
name = unique(c(variables$column, variables$SportType, variables$DataTypeRaw))
)
# Function to map each label to numeric index
get_index <- function(x) match(x, nodes$name) - 1
# Links
links <- bind_rows(
variables %>%
transmute(
source = get_index(column),
target = get_index(SportType),
type = sourceType,
hover = VariablesCollected
),
variables %>%
transmute(
source = get_index(SportType),
target = get_index(DataTypeRaw),
type = sourceType,
hover = VariablesCollected
)
)
color_map <- c(
"Kaggle" = "#FFB347",
"Package" = "#77DD77",
"Article" = "#779ECB"
)
links$color <- color_map[links$type]
# Plotly Sankey
fig <- plot_ly(
type = "sankey",
arrangement = "snap",
node = list(
label = nodes$name,
color = "grey",
pad = 15,
thickness = 20,
line = list(color = "black", width = 0.5)
),
link = list(
source = links$source,
target = links$target,
value = rep(1, nrow(links)),
color = links$color,
customdata = links$hover,
hovertemplate = "%{customdata}<extra></extra>"
))
fig <- fig %>%
layout(
title = list(
text = "Variables Across Sports Datasets",
font = list(size = 18, color = "#333", family = "Roboto")
),
font = list(size = 12),
margin = list(l = 10, r = 10, t = 60, b = 10),
annotations = list(
list(
x = 0.00, y = 1.05,
text = "<b>Datasets</b>",
showarrow = FALSE,
xref = "paper", yref = "paper",
font = list(size = 14, color = "#FFB347", family = "Roboto")
),
list(
x = 0.50, y = 0.76,
text = "<b>Sports</b>",
showarrow = FALSE,
xref = "paper", yref = "paper",
font = list(size = 14, color = "#77DD77", family = "Roboto")
),
list(
x = 0.95, y = 0.75,
text = "<b>Variables</b>",
showarrow = FALSE,
xref = "paper", yref = "paper",
font = list(size = 14, color = "#779ECB", family = "Roboto")
)
)
)
figManually will proceed analysing and scoring all the datasets based on the following table:
criteria <- tribble(
~Criterion, ~Score0, ~Score1, ~Score2, ~Score3, ~Score4, ~Score5, ~Why_it_matters, ~Weight,
"PopulationType", "Undefined or unclear", "Mentioned but not specified", "Defined but non-specific", "Clearly defined target group", "Multiple subgroups", "Representative population", "Scope & external validity", "10%",
"SampleRaw", "<100", "100–499", "500–999", "1k–9k", "10k–99k", "≥100k or continuous", "Statistical power / stability", "15%",
"SportType", "Not stated", "Unclear type", "1 sport", "2–3 sports", "4–6 sports", "Multi-sport", "Cross-sport generalisability", "10%",
"DataTypeRaw", "Derived only", "Simple data", "Data + derived", "Tabular + image/video", "Tabular + image + video", "Tabular + image + video + time-series", "Synthetic realism potential", "20%",
"VariablesCollected", "Few", "Single data", "Metrics + demographics", "Metrics + player + game stats", "Metrics + player + game + context", "Metrics + player + game + context + metadata", "Modeling depth & richness", "30%",
"Documentation", "None", "Minimal", "Variable list", "Readme + schema", "Schema + examples", "Full docs + code", "Reusability & reproducibility", "5%",
"Access", "Payment", "Manual request", "Online request", "Partially open (data)", "Partially open (license)", "Fully open", "Ease of reuse", "5%",
"Data Cleanliness", "Missing values", "Many issues", "Minor errors", "Clean", "Clean + consistent", "Curated / validated", "Preprocessing quality", "5%"
)
knitr::kable(criteria, align = "l", caption = "Dataset Evaluation Criteria")| Criterion | Score0 | Score1 | Score2 | Score3 | Score4 | Score5 | Why_it_matters | Weight |
|---|---|---|---|---|---|---|---|---|
| PopulationType | Undefined or unclear | Mentioned but not specified | Defined but non-specific | Clearly defined target group | Multiple subgroups | Representative population | Scope & external validity | 10% |
| SampleRaw | <100 | 100–499 | 500–999 | 1k–9k | 10k–99k | ≥100k or continuous | Statistical power / stability | 15% |
| SportType | Not stated | Unclear type | 1 sport | 2–3 sports | 4–6 sports | Multi-sport | Cross-sport generalisability | 10% |
| DataTypeRaw | Derived only | Simple data | Data + derived | Tabular + image/video | Tabular + image + video | Tabular + image + video + time-series | Synthetic realism potential | 20% |
| VariablesCollected | Few | Single data | Metrics + demographics | Metrics + player + game stats | Metrics + player + game + context | Metrics + player + game + context + metadata | Modeling depth & richness | 30% |
| Documentation | None | Minimal | Variable list | Readme + schema | Schema + examples | Full docs + code | Reusability & reproducibility | 5% |
| Access | Payment | Manual request | Online request | Partially open (data) | Partially open (license) | Fully open | Ease of reuse | 5% |
| Data Cleanliness | Missing values | Many issues | Minor errors | Clean | Clean + consistent | Curated / validated | Preprocessing quality | 5% |
We have added the rank sheet to the main file to store
the scores. Two new columns were generated manually named as
TotalScore representing the scores assigned to each dataset
and literatureCategory representing the category assigned
by the literature review analysis (GAN-based or
Statistical).
# Select the variable of interest
summaryScore <- summary %>%
select(column, TotalScore, literatureCategory, ValidData, PopulationType,
SportType, DataTypeRaw)
summaryScore# Rename the columns
summaryScore <- summaryScore %>%
rename(dataset = column,
group = literatureCategory,
value = TotalScore) %>%
mutate(group = as.factor(group)) %>%
arrange(group, desc(value))# Create two dataframes to separate plots. Plots will have hover with each dataset info
statsD <- filter(summaryScore, group == "Statistical")
ganD <- filter(summaryScore, group == "GAN-based")
colorPalette <- setNames(
colorRampPalette(brewer.pal(min(max(length(unique(summaryScore$DataTypeRaw)), 3), 8),
"Set2"))(length(unique(summaryScore$DataTypeRaw))),
unique(summaryScore$DataTypeRaw))
stat <- plot_ly(statsD,
x = ~value,
y = ~reorder(dataset, value),
type = 'bar',
orientation = 'h',
color = ~DataTypeRaw,
colors = colorPalette,
hoverinfo = 'text',
marker = list(line = list(width = 1.5)),
text = ~paste(
"<b>Dataset:</b>", dataset,
"<br><b>Value:</b>", round(value, 3),
"<br><b>ValidData:</b>", ValidData,
"<br><b>Population:</b>", PopulationType,
"<br><b>Sport:</b>", SportType,
"<br><b>Data Type:</b>", DataTypeRaw
)) %>%
layout(title = "",
xaxis = list(title = ""),
yaxis = list(title = "Statistical"), tickmode = "array", automargin = TRUE)
gan <- plot_ly(ganD,
x = ~value,
y = ~reorder(dataset, value),
type = 'bar',
orientation = 'h',
color = ~DataTypeRaw,
colors = colorPalette,
hoverinfo = 'text',
marker = list(line = list(width = 1.5)),
text = ~paste(
"<b>Dataset:</b>", dataset,
"<br><b>Value:</b>", round(value, 3),
"<br><b>ValidData:</b>", ValidData,
"<br><b>Population:</b>", PopulationType,
"<br><b>Sport:</b>", SportType,
"<br><b>Data Type:</b>", DataTypeRaw
)) %>%
layout(title = "",
xaxis = list(title = ""),
yaxis = list(title = "GAN-based"), tickmode = "array", automargin = TRUE)
subplot(stat, gan, nrows = 2, shareX = FALSE, titleY = TRUE) %>%
layout(title = "Ranking of datasets by Approach (Statistical vs GAN-based Approaches)",
showlegend = TRUE)